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Abstract 

This paper develops a theory for characterisation of DNA sequences based on their 
measure representation. The measures are shown to be random cascades generated by 
an infinitely divisible distribution. This probability distribution is uniquely determined 
by the exponent function in the multifractal theory of random cascades. Curve fitting 
to a large number of complete genomes of bacteria indicates that the Gamma density 
function provides an excellent fit to the exponent function, and hence to the probability 
distribution of the complete genomes. 



1 Introduction 

DNA sequences are of fundamental importance in understanding living organisms, since 
all information on their hereditary evolution is contained in these macromolecules. One of 
the challenges of DNA sequence analysis is to determine the patterns of these sequences. It is 
useful to distinguish coding from noncoding sequences. Problems related to the classification 
and evolution of organisms using DNA sequences are also important. 

Fractal analysis has proved useful in revealing complex patterns in natural objects. 
Berthelsen et al. considered the global fractal dimension of human DNA sequences 
treated as pseudorandom walks. Vieira || carried out a low-frequency analysis of the com- 
plete DNA of 13 microbial genomes and showed that their fractal behaviour does not always 
prevail through the entire chain and the autocorrelation functions have a rich variety of 



behaviours including the presence of anti-persistence. Provata and Almirantis [14] proposed 
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a fractal Cantor pattern of DNA. They mapped coding segments to filled regions and non- 
coding segments to empty regions of a random Cantor set and then calculated the fractal 
dimension of this set. They found that the coding/noncoding partition in DNA sequences of 
lower organisms is homogeneous-like, while in the higher eucariotes the partition is fractal. 
Yu and Anh |16| proposed a time series model based on the global structure of the complete 
genome and found that one can get more information from this model than that of fractal 
Cantor pattern. Some results on the classification and evolution relationship of bacteria were 
found in [ 16| . The correlation property of length sequences was discussed in [[HJ . 

Although statistical analysis performed directly on DNA sequences has yielded some 
success, there has been some indication that this method is not powerful enough to amplify 
the difference between a DNA sequence and a random sequence as well as to distinguish DNA 
sequences themselves in more details. One needs more powerful global and visual methods. 
For this purpose, Hao et al. [0] proposed a visualisation method based on coarse-graining 
and counting of the frequency of appearance and strings of a given length. They called it the 
portrait of an organism. They found that there exist some fractal patterns in the portraits 
which are induced by avoiding and under-represented strings. The fractal dimension of the 
limit set of portraits was discussed in [18|, ||. There are other graphical methods of sequence 
patterns, such as chaos game representation (see JT0|, ||). 

In the portrait representation, Hao et al. used squares to represent substrings and 
discrete colour grades to represent the frequencies of the substrings in the complete genome. 
It is difficult to know the accurate value of the frequencies of the substrings from the portrait 
representation. And they did not discuss the classification and evolution problem. In order 
to improve it, Yu et al [15| used subintervals in one-dimensional space to represent substrings 
to obtain an accurate histogram of the substrings in the complete genome. The histogram, 
viewed as a probability measure and was called the measure representation of the complete 
genome, gives a precise compression of the genome. Multifractal analysis was then proposed 
in Yu et al |15| to treat the classification and evolution problem based on the measure 
representation of different organisms. 

In this paper, we go one step further and provide a complete characterisation of the 
DNA sequences based on their measure representation. This is given in the form of the 
probability density function of the measure. We first show that the given measure is in fact 
a multiplicative cascade generated by an infinitely divisible distribution. This probability 
distribution is uniquely determined by the exponent K (q) , q > 0, in the multifractal analysis 
of the cascade. This theory will be detailed in the next section. We then apply the theory 
on a large number of typical genomes. It will be seen that the Gamma density function 
provides an excellent fit to the K (q) curve of each genome. This characterisation therefore 
provides a needed tool to study the evolution of organisms. 
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2 Measure representation of complete genome 



We first outline the method of Yu et al |T5| in deriving the measure representation of a 
DNA sequence. Such a sequence is formed by four different nucleotides, namely adenine (a), 
cytosine (c), guanine (g) and thymine (t). We call any string made of K letters from the 
set {g,c,a,t} a if-string. For a given K there are in total 4 K different .fT-strings. In order 
to count the number of each kind of i^-strings in a given DNA sequence, 4 K counters are 
needed. We divide the interval [0, 1[ into 4 K disjoint subintervals, and use each subinterval 
to represent a counter. Letting s = si • • • sk, Sj £ {a, c, g, t}, i = 1, • • • , K, be a substring 
with length K, we define 
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We then use the subinterval [xi(s),x r (s)) to represent substring s. Let N(s) be the times 
of substring s appearing in the complete genome. If the number of bases in the complete 
genome is L, we define 



F(s) = N(s)/(L-K + l) 

to be the frequency of substring s. It follows that J2{ s } F(. s ) 
measure fix on [0, 1) by 



(2.4) 

1. Now we can define a 



Hk (dx) = Yk (x) dx, 



where 



Y K (x) = A K F K (s), x e [xi(s),x r (s)). 



(2.5) 



We then have hk ([0, 1)) = 1 and hk ([ x i{s),x r (s))) = F K (s). We call Hk (x) the measure 
representation of an organism. As an example, the measure representation of M. genitalium 
for K — 3, 8 is given in Figure 1. Self-similarity is apparent in the measures. 

More than 33 bacterial complete genomes are now available in public databases. There 
are six Archaebacteria (Archaeoglobus fulgidus, Pyrococcus abyssi, Methanococcus jannaschii, 
Pyrococcus horikoshii, Aeropyrum pernix and Methanobacterium thermoautotrophicum); five 
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Gram-positive Eubacteria (Mycobacterium tuberculosis, Mycoplasma pneumoniae, Mycoplasma 
genitalium, Ureaplasma urealyticum, and Bacillus subtilis). The others are Gram-negative 
Eubacteria, which consist of two Hyperthermophilic bacteria (Aquifex aeolicus and Ther- 
motoga maritima); five Chlamydia (Chlamydia trachomatisserovar, Chlamydia muridarum, 
Chlamydia pneumoniae and Chlamydia pneumoniae AR39); two Spirochaete (Borrelia burgdor- 
feri and Treponema pallidum); one Cyanobacterium (Synechocystis sp. PCC6803); and 
thirteen Proteobacteria. The thirteen Proteobacteria are divided into four subdivisions, 
which are alpha subdivision (Rhizobium sp. NGR234 and Rickettsia prowazekii); gamma 
subdivision (Escherichia coli, Haemophilus influenzae, Xylella fastidiosa, Vibrio cholerae, 
Pseudomonas aeruginosa and Buchnera sp. APS); beta subdivision (Neisseria meningi- 
tidis MC58 and Neisseria meningitidis Z2491); epsilon subdivision (Helicobacter pylori J99, 
Helicobacter pylori 26695 and Campylobacter jejuni). The complete sequences of some chro- 
mosomes of higher organisms are also currently available. We selected the sequences of 
Chromosome 15 of Saccharomyces cerevisiae, Chromosome 3 of Plasmodium falciparum, 
Chromosome 1 of Caenorhabditis elegans, Chromosome 2 of Arabidopsis thaliana and Chro- 
mosome 22 of Homo sapiens. 

In our previous work (Yu et al [15|), we calculated the numerical dimension spectra D q 
(defined in next section) for all above organisms and for different K. For small K, there are 
only a few different -K"-strings, so there is not enough information for any clear-cut result. 
We find that the D q curves are very close to one another for K — 6, 7, 8 for each organism. 
Hence it would be appropriate to take K = 8 if we want to use the D q curves to discuss 
the classification and evolution problem. It is still needed to know what is the analytical 
expression of the dimension spectra. The main aim of this paper is to establish a theoretical 
model to give such an analytical expression. 



3 Multifractal models 

Let e (t) be a positive stationary stochastic process on a bounded interval of R, assumed 
to be the unit interval [0, 1] for convenience, with Ee (t) = 1. The smoothing of e (t) at scale 
r > is defined as 



1 rt+r/2 

e r (t) = - / e(s)ds. (3.1) 

r Jt-r/2 



For 0<r<w<f, we consider the processes 



*r,v (t) = t e [0, 1] . 



Following Novikov [TjJ, we assume the following scale invariance conditions: 

(i) The random variables X T)U and X U)V are independent; 

(ii) The probability distribution of each random variable X UyV depends only on the ratio 
u/v of the corresponding scales. 
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These conditions imply the power-law form for the moments of the processes X u v if they 
exist. In fact, we may write 

E(X u , v (t)f = g q (^y q>0 (3.2) 

from condition (ii) for some function g which also depends on q. From the identity 

X r>v (t) = X TiU (t) X UjV (t) 

and condition (i) we get 

9* (~) = 9, ( L ) 9, (-) ■ (3-3) 



Since u is arbitrary, we then have 



,v J \v 

for some function K (q) with K (0) = 0. It follows that 
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Writing Y for X r>v we obtain 
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Since 

{E{Y q lnY)f = (E(Y q/2 Y q/2 \YiYyf 

< (EY q ) E (Y q (lnF) 2 ) (3.5) 

by Schwarz's inequality and v/r > 1, we get K" (g) > 0, that is, K (q) is a convex function. 
It is noted that equality holds in ( |3.5| ) only if K (q) is a linear function of q; other than this, 
K (q) is a strictly convex function. 

For < q < 1, we assume that K (q) < 0, which reflects the fact that, in this range, 
taking a gth-power necessarily reduces the singularity of X U)V . Also, we assume that the 
probability density function of X u yV is skewed in the positive direction. This yields that 
K (q) > for q > 1. These assumptions, in conjunction with the strict convexity of K (q) , 
suggest the assumption that 

K(1)=0. (3.6) 



In (v/r) 
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This implies that 

EX U ^ V = 1 for arbitrary < u < v . (3.7) 

In this paper, we will consider smoothing at discrete scales tj = 2~i +1 , j = 0, 1,2,3, ... 
Then the smoothed process at scale rj is 

1 rt+2-i 

Xj (t) = e rj (t) = — jf_ 2 _ . e (s) ds. (3.8) 

Under the condition Ee (t) = 1, it is reasonable to assume that 

X (t) =1, t G [0,1]. (3.9) 

Then, at generation J, 

X, (t) X 2 (t) Xj (t) 



Xj(t) = X (t) 



x (t) x 1 {t) ■ Xj_ x it) 

X 1 (t)X 2 (t) Xj (t) 

x (t)x 1 (t) "■'x J _ 1 (t)' 



(3.10) 



Under the scale invariance conditions (i) and (ii), the random variables Xj/Xj_i of (|3.10|) are 
independent and have the same probability distribution. Let W denote a generic member of 
this family. Note that EW = 1 from (|3.7|) . Then (|3.10j) can be rewritten as 

Xjit) = Xj^(t) x,{l] 



Xj-! (t) 

= W 1 (t)W 2 (t)...W J (t), te [0,1]. (3.11) 

In other words, Xj (t) is a multiplicative cascade process (see Holley and Waymire 0, Gupta 
and Waymire 0). Denote by \xj the sequence of random measures defined by the density 
Xj (t) , that is, 

fij(dt)=Xj(t)dt, J =1,2,3,... 

It can be checked that \xj a.s. has a weak* limit since for each bounded continuous func- 
tion / on [0, 1] , the sequence Jj 0jl ] fdfij is an Li-bounded martingale (see Holley and Waymire 
@ , Mandelbrot [Q , Kahane and Peyriere fill ) . We denote the density corresponding to /x^ 
by (t) . Then it is seen from ( |3.8|) that 

X 00 (t)=e(t), te[0,l]. (3.12) 

Summarising, we have established that 

The positive stationary process e (t) is the limit of a 
multiplicative cascade with generator W . 
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We next want to characterise this random cascade. We first note that, for j = 1, 2, 3 

rt+2-U-K 



X~ - 2 rt+2-u-v ' ^ 2 ( 3 - 13 ) 



from the positivity of e (i) . Thus, 



This inequality together with (|3.4j) imply 

K(q)<q, q>0. (3.14) 

We then have 



/ Y \ 2q \ ~~ q 00 /1 \ ^ 



oo. 



In other words, the Carleman condition is satisfied (see Feller M, p. 224). As a result, we 
get 

The probability density function fw of the generator W is 
uniquely determined by the set {K (q) , q = 0, 1, 2, ...} . 

It is seen that, if the function K (q) has analytic continuation into the complex plane, 
then the characteristic function of In W has the form 

^{x)=E(e M ) = {^ . (3.15) 

Define xp n (x) = (1/2 /n J for an arbitrary integer n. Then ip n is the characteristic 

function of the probability distribution corresponding to smoothing with scales (2 l l n ^j J 
Also, it holds that 

lp(x) = (lp n (x)) n . 

Thus ip (x) is infinitely divisible (see Feller |4[], p. 532); in other words, 

InW has an infinitely divisible distribution. (3.16) 



It is noted from ( |3.13| ) that — In ^ > 0. The most general form for the characteristic function 



if (x) of positive random variables is given by 



poo e *xs 1 

if (x) = exp < J P(ds)+iax>, (3-17) 
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where a > and P is a measure on the open interval (0, oo) such that / °° (1 + s) 1 P (ds) < 
oo (see Feller p. 539). On the other hand, it follows from ( |3.2|) and ( |3.4| ) that the 



characteristic function of — In 2- is given by 



2 ix E(Wy ix 



(3.18) 



Using q 



-ix 



and equating ( |3.17| ) with ( |3.18| ) then yields 



K(q) 



q 



P{ds) 



(3.19) 



As constrained by (|3 



In 2/ Jo s ln2 
the following condition must be satisfied by the measure P (ds) : 
r°° 1 - e~ s P (ds) a 



In 2 



In 2 



< 1. 



(3.20) 



Equations ( |3.19| ) and ( p. 20 ) provide the most general form for the K (q) curve of the positive 
random process {e (t) , < t < 1} . 

In practice, fitting this K (q) curve to data requires a proper choice of the measure P (ds) . 



Novikov |T3| suggests the use of the Gamma density function, namely, 

/ (x) = Ax*' 1 exp (-x/a) , (3.21 
where P (dx) = f (x) dx and A, a, a are positive constants. From ( |3.19| ) and ( |3.21 

« (q 



K(q) 



(gq+l) 1 -"-! 



where k 



K \q 

a I In 2, and from ( |3.20| ) we have 
Kln2 



ln( g q+l) \ 
ln( CT +l) J ' 



a = 1. 



we get 
(3.22) 



A 



a 



Q-l 



Tia-V 



(l-(a + iy~ a )- L . 



The form ( |3.22|) will be used for data fitting in this paper. It is seen from ( |3.2|) and ( |3T4] 
that the data for the K (q) curve is provided by 



K (q) = lim 



InE(Xj) 



J^oo -ln2~ J+1 



(3.23) 



where it should be noted from ( |3.12| ) that (t) = e (t) , the given positive random process. 

Since each smoothed process Xj may possess long-range dependence (see Anh et al. [j]), 
the ergodic theorem may not hold for these processes. As a result, the computation of 
E (Xj) as sample averages may not be sufficiently accurate. There is an alternative form 
of the ergodic theorem developed by Holley and Waymire B for random cascades which we 



now summarise. 
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For random cascades with density e (t) , limit measure /j,^, branching number b and 
generator W, define 

M J (g) = E'(^oo(A fe J )) g , (3.24) 

k 

lnMj(g) 

t W = ) im n 7 » (3-25) 



I> fl = T(g)/(g-l), 



(3.26) 



Xfe (g)=log 6 £(W«)-( g -l) 



(3.27) 



where the prime in ( 3.24j ) indicates a sum over those subintervals of generation J which 
meet the support of fi^. 

Theorem 1 (Holley and Waymire f^J) Assume that W > a for some a > and W < b with 
probability 1, and that E (W 2q ) / (EW q ) 2 < b. Then, with probability 1, 



t (<?) = ~Xb (?) • 



(3.28) 



In our case as developed above, 6 = 2, and ( fj.l3| ) gives W < 2. In fact the scale rj = 2~ J+1 
used in ( |3.8|) is arbitrary; it can be 6~- J+1 and the inequality W < b still holds by definition 
of the smoothing and the positivity of e (t) . In our development, 

w j^oo In2- J+1 

J\nE(W q ) 



lim 

J-oo (J - l)ln2- 1 
\nE(W q ) 
hi~2 ' 



using ( p. 11 



Consequently, 



K(q) = -r(q) + q-l. 



(3.29) 



The above formula then provides a way to compute K (q) via ( |3.25|) and ( |3.29| ) using sums 
of q-th powers of the limit measure instead of ( |3.23|) using expectations. In fact, the ergodic 
theorem now takes the following form 



InEiXj) 



lim . , 
j^oo(J-l)ln2 J^oo 



hm — + q-l. 



J In 2 
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4 Data fitting and discussion 



For K = 8, we first calculated K(q) of the measure representation of all the above 
organisms directly from the definition of K(q) ( |3.23|) . Figure 2 shows how to calculate this 
K(q) curve. We give the K(q) curves of E. coli, S. cerevisiae Chrl5, C. elegans Chrl, A. 
thaliana Chr2, and Homo sapiens Chr22 in Figure 3. From Figure 3, it is seen that the 
grade of the organism is lower when the K q curve is flatter. Hence the evolution relationship 
of these organisms is apparent. We denote by Kd{q) the value of K(q) computed from the 



data using its definition ( |3.23|) and define 



error = 

3=1 



{q 3 a + lf- a -I, 



§ K * " (a + iy--l } " Kd ^ 



2 



Then the values of re, a and a can be estimated through minimising error. In this minimi- 



sation, we assume 



< re, a, a < 20. 

After obtaining the value of re, a and a, we then get the K(q) curve from ( |3.22| ). The 
data fitting based on the form ( |3.22| ) was performed on all the organisms and shown in 
Table 1 (from top to bottom, in the increasing order of the value of re). It is found that 
the form 22 ) gives a perfect fit to the data for all bacteria. As an example, we give the 



data fitting of E. coli, S. cerevisiae Chrl 5 and C. elegans Chrl in Figure 4. But for higher 
organisms, for example, Homo sapiens Chromosome 22, the fitting is not as good. Note 
that we only selected one chromosome for each higher organism. If all chromosomes for each 
higher organism are considered, the data fitting for K q will be better. The fit for Human 
Chromosome is the worst in Table 1. Since the length of Human Chromosome 22 is not 
larger than those of the complete genomes of all bacteria, there does not seem to be any 
relationship between the quality of fit and the length of the complete genome. 

The parameter re provides a tool to classify bacteria. From Table 1, one can see He- 
licobacter pylori 26695 and Helicobacter pylori J99 group together, and three Chlamydia 
almost group together. But this parameter re alone is not sufficient, it must be combined 
with other tools to classify bacteria. 

We also calculated the values of r(q) using its definition ( |3.25[) . We found the values of 



Kd(q) coincide with those obtained from ( |3.29|) . Hence we indeed can use ( p.29|) to calculate 



K(q). Formula ( |3.22j ) gives an analytical expression for the quantity K q . An analytical 
expressions for r(q) can therefore be obtained from ( |3.29| ) and D q from ( |3.26| ). 



5 Conclusions 

The idea of our measure representation is similar to the portrait method proposed by Hao 
et al. [0]. It provides a simple yet powerful visualisation method to amplify the difference 
between a DNA sequence and a random sequence as well as to distinguish DNA sequences 
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themselves in more details. From our measure representation we can exactly know the 
frequencies of all the if-string appearing in the complete genome. But the representations 
alone are not sufficient to discuss the classification and evolution problem. Hence we need 
further tools. 

In our previous work (Yu et al ||15||), when the measure representations of organisms were 
viewed as time series, it was found that they are far from being random time series, and in 
fact exhibit strong long-range dependence. Multifractal analysis of the complete genomes 
was performed in relation to the problem of classification and evolution of organisms. In 
this paper, we established a theoretical model of the probability distribution of the complete 
genomes. This probability distribution, particularly the resulting K(q) curve, provides a 
precise tool for their characterisation. Numerical results confirm the accuracy of the method 
of this paper. 

For a completely random sequence based on the alphabet {a,c,g,t}, we have D q = 
1, r(q) = q — 1, K{q) = for all q. From the K(q) curves, it is seen that all complete 
genomes selected are far from being a completely random sequence. 
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Table 1: The values of k, a, a, error of all the organisms selected. 
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Figure 1: Histograms of substrings with different lengths 
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Figure 2: An example to show how to obtain the value of K{q) directly using its definition. 
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Figure 3: The values of K (q) of Chromosome 22 of Homo sapiens, Chromosome 2 of A. thaliana, Chromo- 
some 1 of C. elegans, Chromosome 15 of S. cerevisiae and E. coli. 
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Figure 4: The data fitting of E. coli, chromosome 15 of S. cerevisiae and chromosome 1 of C. elegans based 
on the Gamma model. The symbolled curves represent Kd (q) computed from data, while the continuous 
curves represent K (q) computed from formula ( 3 . 22| ) . 
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